This notebook seeks to leverage appropriate machine learning techniques to enable similarity search with audio files. The end goal is an effective audio similarity search algorithm that can return similar audio to a user selected file. After reviewing the research, the following approach is attempted:
# Python built-ins
import time
import os
import random
# math/vector maniuplation and plots
import numpy as np
import matplotlib.pyplot as plt
# image saving
from skimage import io
# audio processing library
import librosa
import librosa.display
# audio playback in notebook
import IPython.display as ipd
# Scikit Learn
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.metrics.pairwise import cosine_similarity
# TensorFlow and Keras
import tensorflow as tf
from tensorflow import keras
#from tensorflow.keras import layers
#from tensorflow.keras.models import Model
#from tensorflow.keras.callbacks import TensorBoard
max_audio_duration = 2 # seconds
audio_samplerate = 22050 # Hz
n_fft = 4096 # number of ffts per window
fmin = 20 # min frequency to consider in mel cepstral coeffs
fmax = audio_samplerate // 2 # max frequency to consider in mel cepstral coeffs
input_image_size = (256, 256) # (width, height) pixels
image_value_type = np.uint16 # bit-depth of image, 16-bit seems to provide the best results
image_max_value = np.iinfo(image_value_type).max # used in scaling values 0-1
test_set_size = 0.2 # % of dataset to reserve for test set
trained_encoder_location = 'trained_encoders/ConvEncoder'
training_clip_paths = ['Drums', 'Bass', 'Keys']
inference_clip_paths = ['Drums', 'Bass', 'Keys']
def get_clip_paths(rel_paths):
"""
Returns a list of full paths for all audio supported files under each path in rel_paths.
"""
clip_paths = []
for rel_path in rel_paths:
audio_wav_dir = f'audio-data/{rel_path}'
for root, dirs, files in os.walk(audio_wav_dir):
for file in files:
path = os.path.join(root, file)
if path.endswith('.wav'):
clip_paths.append(path)
return clip_paths
def get_audio_samples(path, min_samples):
"""
Returns list of samples for an audio file at path,
"""
y, _ = librosa.load(path=path, mono=True, sr=audio_samplerate, offset = 0, duration = max_audio_duration)
# check if needs padding, up to n_fft used in mel calcs
if len(y) < min_samples:
padding = min_samples - len(y)
y = np.pad(y, (0, min_samples - len(y)), 'constant')
return y
def scale_minmax(X, max=image_max_value):
"""
Normalizes values in range 0 - max.
"""
X_min = X.min()
X_max = X.max()
X_diff = X_max - X_min
return ((X - X_min)/X_diff) * max
def get_mel_spectrogram(y, sr, filename, save_file=False):
"""
Returns mel spectrogram image data given audio data.
"""
S = librosa.feature.melspectrogram(y=y,
sr=sr,
n_mels=input_image_size[1],
n_fft=n_fft,
hop_length=max(int(len(y)/input_image_size[0]), 1),
fmin = fmin,
fmax = fmax)
# convert to power-scale (decibels)
img = librosa.power_to_db(S, ref=np.max).astype(np.float32)
# discard extra mfcc columns, pad missing
if img.shape[1] > input_image_size[0]:
img = img[:, :input_image_size[0]]
elif img.shape[1] < input_image_size[0]:
img = np.pad(img, [(0,0), (0,input_image_size[0] - img.shape[1])], 'constant')
# scale mel values to image values
img = scale_minmax(img)
# put low frequencies at the bottom in image (typical human readable format)
img = np.flip(img, axis=0)
img = img.astype(image_value_type)
# save as PNG
if save_file:
print(f'Saving PNG: {filename}')
io.imsave(filename.replace(".wav", ".png"), img)
return img
def get_data(paths, save_images=False):
"""
Get data from paths, prepared for training from. Optionally save Mel Spectrogram image files.
"""
audio_paths = get_clip_paths(paths)
mels = []
paths = []
for path in audio_paths:
paths.append(path)
samples = get_audio_samples(path, n_fft)
mel = get_mel_spectrogram(samples, audio_samplerate, path, save_file=save_images)
# we add a color channel, as the model will expect this in the input shape, and append to dataset
mels.append(mel.reshape(input_image_size[1], input_image_size[0], 1))
# Scale values to 0-1
X = np.array(mels)
X = X / float(image_max_value)
return paths, X
Each image below is a Mel Spectrogram depicting an audio file. In these images, frequencies are represented on a log scale on the Y axis (lower frequencies on the bottom, higher frequencies on the top), across time on the X axis (scaled to the duration of audio imported, up to 2 seconds).
|
|
|
|
Y, X = get_data(training_clip_paths, save_images=False)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_set_size, random_state=42)
X_train = np.reshape(X_train, (len(X_train), input_image_size[0], input_image_size[1], 1))
X_test = np.reshape(X_test, (len(X_test), input_image_size[0], input_image_size[1], 1))
The convolutional autoencoder below is built using the Keras functional API. It consists of a relatively simple Encoder and Decoder made of convolution and deconvolution (upsampling) layers. As configured, the Encoder creates an embedding with shape (16,16,16) which represents 75% memory compression from the input space. The Decoder simply reconstructs the input from the encoded embedding. The model uses Adam optimization and MSE loss.
Many architectures were tested, adjusting the number of filters, kernel size, and activation type, as well as number of layers (how far the compression could go and still be useful). The addition of a Dense (fully connected) layer was also tested at the 'bottom' of the Encoder, as seen in a number of examples in the research. However, this seemed to make training loss far higher, so the inclusion of this layer was quickly abandoned.
input_img = tf.keras.layers.Input(shape=(input_image_size[0],input_image_size[1], 1))
# Encoder
x = tf.keras.layers.Conv2D(128, (3, 3), activation="relu", padding="same")(input_img)
x = tf.keras.layers.MaxPooling2D((2, 2), padding="same")(x)
x = tf.keras.layers.Conv2D(64, (3, 3), activation="relu", padding="same")(x)
x = tf.keras.layers.MaxPooling2D((2, 2), padding="same")(x)
x = tf.keras.layers.Conv2D(32, (3, 3), activation="relu", padding="same")(x)
x = tf.keras.layers.MaxPooling2D((2, 2), padding="same")(x)
x = tf.keras.layers.Conv2D(16, (3, 3), activation="relu", padding="same")(x)
encoded = tf.keras.layers.MaxPooling2D((2, 2), padding="same")(x)
# encoded.shape = (16,16,16)
# ~94% value compression (16*16*16)/(256*256)
# ~75% memory compression (16*16*16*4 bytes)/(256*256*1 bytes)
# Decoder
x = tf.keras.layers.Conv2D(16, (3, 3), activation='relu', padding='same')(encoded)
x = tf.keras.layers.UpSampling2D((2, 2))(x)
x = tf.keras.layers.Conv2D(32, (3, 3), activation='relu', padding='same')(x)
x = tf.keras.layers.UpSampling2D((2, 2))(x)
x = tf.keras.layers.Conv2D(64, (3, 3), activation='relu', padding='same')(x)
x = tf.keras.layers.UpSampling2D((2, 2))(x)
x = tf.keras.layers.Conv2D(128, (3, 3), activation='relu', padding='same')(x)
x = tf.keras.layers.UpSampling2D((2, 2))(x)
decoded = tf.keras.layers.Conv2D(1, (3, 3), padding='same')(x) # no activation
# Autoencoder
autoencoder = tf.keras.models.Model(input_img, decoded)
autoencoder.compile(optimizer="adam", loss="mean_squared_error")
autoencoder.summary()
2022-10-20 14:04:50.648237: E tensorflow/stream_executor/cuda/cuda_driver.cc:265] failed call to cuInit: CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE: forward compatibility was attempted on non supported HW 2022-10-20 14:04:50.648361: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: pop-os 2022-10-20 14:04:50.648391: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: pop-os 2022-10-20 14:04:50.648690: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 515.65.1 2022-10-20 14:04:50.648757: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 510.85.2 2022-10-20 14:04:50.648775: E tensorflow/stream_executor/cuda/cuda_diagnostics.cc:313] kernel version 510.85.2 does not match DSO version 515.65.1 -- cannot find working devices in this configuration 2022-10-20 14:04:50.649404: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Model: "model"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) [(None, 256, 256, 1)] 0
conv2d (Conv2D) (None, 256, 256, 128) 1280
max_pooling2d (MaxPooling2D (None, 128, 128, 128) 0
)
conv2d_1 (Conv2D) (None, 128, 128, 64) 73792
max_pooling2d_1 (MaxPooling (None, 64, 64, 64) 0
2D)
conv2d_2 (Conv2D) (None, 64, 64, 32) 18464
max_pooling2d_2 (MaxPooling (None, 32, 32, 32) 0
2D)
conv2d_3 (Conv2D) (None, 32, 32, 16) 4624
max_pooling2d_3 (MaxPooling (None, 16, 16, 16) 0
2D)
conv2d_4 (Conv2D) (None, 16, 16, 16) 2320
up_sampling2d (UpSampling2D (None, 32, 32, 16) 0
)
conv2d_5 (Conv2D) (None, 32, 32, 32) 4640
up_sampling2d_1 (UpSampling (None, 64, 64, 32) 0
2D)
conv2d_6 (Conv2D) (None, 64, 64, 64) 18496
up_sampling2d_2 (UpSampling (None, 128, 128, 64) 0
2D)
conv2d_7 (Conv2D) (None, 128, 128, 128) 73856
up_sampling2d_3 (UpSampling (None, 256, 256, 128) 0
2D)
conv2d_8 (Conv2D) (None, 256, 256, 1) 1153
=================================================================
Total params: 198,625
Trainable params: 198,625
Non-trainable params: 0
_________________________________________________________________
This model trains relatively quickly, even on CPU. The model is set to train for only 10 epochs, however, it seems to converge rapidly and it is not clear that training further would yield signifcantly better results. Shuffle is set to True, which will shuffle the input dataset at the beginning of each epoch (this is a type of regularization).
Note that X_train is passed as x and y, given this is an autoencoder working with unlabeled data. The inputs are fed to the autoencoder, and the outputs of the autoencoder are compared to the inputs. Keras models also support passing in validation data (X_test in this case), which allows loss evaluation at the end of each epoch – the model is not trained on this data.
autoencoder.fit(X_train, X_train,
epochs=10,
batch_size=32,
shuffle=True,
validation_data=(X_test, X_test))
2022-10-20 14:04:54.848960: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 160432128 exceeds 10% of free system memory.
Epoch 1/10
2022-10-20 14:04:55.221853: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 160432128 exceeds 10% of free system memory. 2022-10-20 14:04:56.212995: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 1073741824 exceeds 10% of free system memory. 2022-10-20 14:04:56.598244: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 268435456 exceeds 10% of free system memory. 2022-10-20 14:04:56.689577: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 134217728 exceeds 10% of free system memory.
20/20 [==============================] - 246s 12s/step - loss: 0.0336 - val_loss: 0.0089 Epoch 2/10 20/20 [==============================] - 180s 9s/step - loss: 0.0077 - val_loss: 0.0066 Epoch 3/10 20/20 [==============================] - 171s 9s/step - loss: 0.0062 - val_loss: 0.0060 Epoch 4/10 20/20 [==============================] - 173s 9s/step - loss: 0.0055 - val_loss: 0.0054 Epoch 5/10 20/20 [==============================] - 171s 9s/step - loss: 0.0051 - val_loss: 0.0051 Epoch 6/10 20/20 [==============================] - 171s 9s/step - loss: 0.0049 - val_loss: 0.0048 Epoch 7/10 20/20 [==============================] - 171s 9s/step - loss: 0.0046 - val_loss: 0.0046 Epoch 8/10 20/20 [==============================] - 171s 9s/step - loss: 0.0044 - val_loss: 0.0045 Epoch 9/10 20/20 [==============================] - 171s 9s/step - loss: 0.0043 - val_loss: 0.0044 Epoch 10/10 20/20 [==============================] - 171s 9s/step - loss: 0.0042 - val_loss: 0.0043
<keras.callbacks.History at 0x7f328828b4f0>
Once trained, we can visually inspect the results of this autoencoder by comparing input images to reconstructed output images. We also visualize the input embeddings, and although the embedding itself is not 2-dimensional, some sense of what how the embedding represents the input can be ascertained. The reconstructed images reveal the lossy compression of the input embeddings, readily seen in the 'fuzziness' or 'blurriness' of the reconstruction with respect to the original input images. However, the features of the input images have been reconstructed in the output, meaning the input embeddings capture these features quite well.
n = 4
encoder = tf.keras.models.Model(input_img, encoded)
encoded_imgs = encoder.predict(X_test[:n+1])
decoded_imgs = autoencoder.predict(X_test[:n+1])
plt.figure(figsize=(16, 12))
for i in range(1, n + 1):
# Display original
ax = plt.subplot(3, n, i)
plt.imshow(X_test[i].reshape(input_image_size[1], input_image_size[0]), aspect='auto')
plt.gray()
plt.xticks([])
plt.yticks([])
ax.set_title(f'Filename: ...{Y_test[i][-20:-4]}')
ax.set_ylabel('Input')
# Display Embeddings
ax = plt.subplot(3, n, i + n)
plt.imshow(encoded_imgs[i].reshape(64,64), aspect='auto')
plt.gray()
plt.xticks([])
plt.yticks([])
ax.set_ylabel('Emedding')
# Display reconstruction
ax = plt.subplot(3, n, i + n + n)
plt.imshow(decoded_imgs[i].reshape(input_image_size[1], input_image_size[0]), aspect='auto')
plt.gray()
plt.xticks([])
plt.yticks([])
ax.set_ylabel('Reconstruction')
plt.tight_layout()
plt.show()
1/1 [==============================] - 0s 206ms/step 1/1 [==============================] - 3s 3s/step
PCA (Princpal Component Analysis) projection can give a sense of the embedding space and how the embeddings map with respect to each other. Below, this is visualized with 2 components (2 dimensions). The use of the PCA projection (using 3-9 components) of embeddings was tested in similarity search (taking cosine similarity of the PCA projection), but this seemed to perform more poorly than simply using the cosine similarity of the embeddings directly.
num_components = 2
encoded_imgs = encoder.predict(X_test)
embeddings = [np.ravel(e) for e in encoded_imgs]
pca = PCA(n_components=num_components)
pca.fit(embeddings)
pca_proj = pca.transform(embeddings)
print(pca.explained_variance_ratio_)
fig = plt.figure()
ax = fig.add_subplot()
for pca in pca_proj:
ax.scatter(pca[0], pca[1])
plt.show
5/5 [==============================] - 3s 540ms/step [0.51310923 0.16606748]
<function matplotlib.pyplot.show(close=None, block=None)>
This step isolated the trained encoder and saves the model to be used separately.
model = tf.keras.models.Model(inputs=autoencoder.inputs, outputs=autoencoder.layers[8].output)
model.summary
model.save(trained_encoder_location)
WARNING:tensorflow:Compiled the loaded model, but the compiled metrics have yet to be built. `model.compile_metrics` will be empty until you train or evaluate the model.
WARNING:absl:Found untraced functions such as _jit_compiled_convolution_op, _jit_compiled_convolution_op, _jit_compiled_convolution_op, _jit_compiled_convolution_op while saving (showing 4 of 4). These functions will not be directly callable after loading.
INFO:tensorflow:Assets written to: trained_encoders/ConvEncoder/assets
INFO:tensorflow:Assets written to: trained_encoders/ConvEncoder/assets
This simply tests loadig the Encoder model and creating and visualizing embeddings for test data.
# Load the encoder and time inference on the first n inputs from X_test
n = 10
encoder = keras.models.load_model(trained_encoder_location)
%timeit encoded_imgs = encoder.predict(X_test[:n])
plt.figure(figsize=(20, 8))
for i in range(1, n + 1):
ax = plt.subplot(1, n, i)
plt.imshow(encoded_imgs[i].reshape((64, 64)))
plt.gray()
ax.set_title(Y_test[i][-16:-4])
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
plt.show()
WARNING:tensorflow:No training configuration found in save file, so the model was *not* compiled. Compile it manually.
WARNING:tensorflow:No training configuration found in save file, so the model was *not* compiled. Compile it manually.
1/1 [==============================] - 0s 236ms/step 1/1 [==============================] - 0s 194ms/step 1/1 [==============================] - 0s 186ms/step 1/1 [==============================] - 0s 201ms/step 1/1 [==============================] - 0s 190ms/step 1/1 [==============================] - 0s 196ms/step 1/1 [==============================] - 0s 196ms/step 1/1 [==============================] - 0s 213ms/step 226 ms ± 11.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
K-means clustering was intended to be used as a part of the similarity search, however, it appears that the varying cluster density and spread makes this challenging. Perhaps a much larger dataset would make this less of an issue, however this approach was abandoned in favor of using the embeddings directly with a Cosine Similarity matrix, as seen below.
clusters = [4, 8, 16, 32]
scores = []
embeddings = [np.ravel(e) for e in encoded_imgs]
for k in clusters:
kmeans = KMeans(n_clusters=k, random_state=42).fit(embeddings)
score = silhouette_score(embeddings, kmeans.labels_)
print(f'k: {k}, silhouette score: {score}')
scores.append(score)
k = clusters[np.argmax(scores)]
print(f'Optimal k: {k}')
kmeans = KMeans(n_clusters=k, random_state=42).fit(embeddings)
k: 4, silhouette score: 0.20878012478351593 k: 8, silhouette score: 0.196126326918602 k: 16, silhouette score: 0.17918486893177032 k: 32, silhouette score: 0.18431542813777924 Optimal k: 4
Below the Encoder is loaded and used to create embeddings on the entire audio dataset. A Cosine Similarity matrix is then created directly from the embeddings and cached in a 'similarity database' dictionary keyed by audio file path. The _get_similaraudio method can then directly retrieve the _nresults most similar audio files from the database based on the highest similarity values for a given audio file.
# load model and get audio data
start = time.time()
encoder = keras.models.load_model(trained_encoder_location)
all_paths, data = get_data(inference_clip_paths)
data = np.reshape(data, (len(data), input_image_size[0], input_image_size[1], 1))
WARNING:tensorflow:No training configuration found in save file, so the model was *not* compiled. Compile it manually.
WARNING:tensorflow:No training configuration found in save file, so the model was *not* compiled. Compile it manually.
# get encodings and and create similarity (cosine) database
start = time.time()
# create embeddings for all input data
encoded_data = encoder.predict(data)
encoding_time = round(time.time()-start, 2)
print(f'encoding time: {encoding_time:.2f}s, {encoding_time/len(data):.3f}s per audio file.')
encoded_data = [np.ravel(e) for e in encoded_data]
similarity_matrix = cosine_similarity(encoded_data)
24/24 [==============================] - 15s 596ms/step encoding time: 14.70s, 0.019s per audio file.
# create similarity database
similarity_database = {}
for P, S in zip(all_paths, similarity_matrix):
similarity_database[P] = S
def get_similar_audio(path, similarity_database, n_results):
"""
Retrieve n_results most similar audio files to the given audio file at path from the given similarity_database
"""
n_results += 1
all_paths = list(similarity_database.keys())
ipd.display(ipd.Audio(path))
# get similarity results for this audio
similar = similarity_database[path]
# get indexes of n_results highest values
result_indexes = np.argpartition(similar, -n_results)[-n_results:]
# build dictionary from results of paths and similarity scores
results = {k:v for (k,v) in zip([all_paths[x] for x in result_indexes], [similar[x] for x in result_indexes])}
# eliminate self from results
results = {k:v for k, v in results.items() if k != path}
# get keys for results by value in reverse sorted order
sorted_keys = sorted(results, key=results.__getitem__, reverse=True)
#show results
count = 1
for k in sorted_keys:
print(f'Result {count}: {k}, Similarity: {results[k]:.3f}')
ipd.display(ipd.Audio(k))
count += 1
Below the results from similarity search are demonstrated. For _nexamples we select a random audio file from the database and return the _nresults most similar audio files. Using the inline audio players, one can audition the original audio file and the returned results. Running this cell multiple times will produce new examples each time.
# test returning results and check playback
n_examples = 5
n_results = 4
for i in range(n_examples):
path = all_paths[random.randint(0, len(all_paths))] # select a random path from all_paths
print(f'Finding similar sounds to {path}')
get_similar_audio(path, similarity_database, n_results)
print('\n\n-----------------------------------------------------------\n\n')
Finding similar sounds to audio-data/Keys/573628__acollier123__preset-jazz-organ-c.wav
Result 1: audio-data/Keys/573639__acollier123__preset-pearl-drop-c.wav, Similarity: 0.950
Result 2: audio-data/Keys/166009__acollier123__casio-hz600-01-piano-c.wav, Similarity: 0.929
Result 3: audio-data/Bass/110518__nandoo1__nandoo-messany-horror-bell.wav, Similarity: 0.925
Result 4: audio-data/Keys/573633__acollier123__preset-synphonic-ens-c.wav, Similarity: 0.925
----------------------------------------------------------- Finding similar sounds to audio-data/Bass/320046__staticpony1__analog-bass-vel-2.wav
Result 1: audio-data/Bass/320047__staticpony1__analog-bass-vel-1.wav, Similarity: 0.997
Result 2: audio-data/Drums/25641__walter-odington__hot-rod-kick.wav, Similarity: 0.978
Result 3: audio-data/Drums/25642__walter-odington__krusty-kick.wav, Similarity: 0.977
Result 4: audio-data/Drums/25650__walter-odington__super-pulse-kick.wav, Similarity: 0.976
----------------------------------------------------------- Finding similar sounds to audio-data/Drums/183096__dwsd__bd-dust808.wav
Result 1: audio-data/Bass/331480__staticpony1__analog-bass-vel-3.wav, Similarity: 0.957
Result 2: audio-data/Bass/331485__staticpony1__analog-bass-vel-6.wav, Similarity: 0.954
Result 3: audio-data/Drums/183115__dwsd__prc-appet909.wav, Similarity: 0.953
Result 4: audio-data/Drums/183124__dwsd__prc-dust808tomlow.wav, Similarity: 0.946
----------------------------------------------------------- Finding similar sounds to audio-data/Keys/573643__acollier123__preset-synth-celesta-c.wav
Result 1: audio-data/Keys/573627__acollier123__preset-brass-ensemble.wav, Similarity: 0.987
Result 2: audio-data/Keys/573640__acollier123__preset-synth-clavi-c.wav, Similarity: 0.970
Result 3: audio-data/Keys/573629__acollier123__preset-harpsichord-c.wav, Similarity: 0.969
Result 4: audio-data/Keys/573630__acollier123__preset-blues-harmonica-c.wav, Similarity: 0.960
----------------------------------------------------------- Finding similar sounds to audio-data/Drums/270137__theriavirra__02-ride-long-cymbals-snares.wav
Result 1: audio-data/Drums/269967__theriavirra__01-snare-aftershot-smooth-cymbals-snares.wav, Similarity: 0.980
Result 2: audio-data/Drums/270100__theriavirra__snare-sample-2015-10-american-punch.wav, Similarity: 0.978
Result 3: audio-data/Drums/269904__theriavirra__01b-snare-smooth-cymbals-snares.wav, Similarity: 0.977
Result 4: audio-data/Drums/270158__theriavirra__04-snare-smooth-cymbals-snares.wav, Similarity: 0.976
-----------------------------------------------------------
The results demonstrated above seem quite good. The resulting Encoder model and similarity search based on a Cosine Similarity matrix would be imminently useful in the context of an audio sample browser or audio production application, where producers frequently have the need to quickly find similar audio to a given sample. There are, however, a few results returned which do not match the searched audio very well.
A more sophisticated approach, taking into account multiple discrete aspects of auditory perception (using different audio features with more models in an ensemble approach) may lead to more robust results. Further research points to possibly using a Convolutional LSTM model that could take into account the relationship of frequency over time, although the above simpler Convolutional approach does seem to capture a sense of time given the input images x-axis (width) does, in fact, represent time. Simpler features could be accounted for relatively easily, as well, such as RMS loudness, average pitch class (key centers vs atonal), and much more.
This was a very interesting project, and I look forward to continuing research in this direction.